TL;DR
Polynomial regression (degree 2, R² = 0.81 test) outperforms linear baseline (R² = 0.63 test) on 1,338 insurance records. Smokers pay an average of $23,615 vs $8,434 for non-smokers — 2.8× more. BMI is the second strongest continuous predictor. Southeast region shows highest average charges at $14,735.
R² = 0.81 (test set)
1,338 Records
+29% vs Linear Baseline
Smoking = 2.8× higher cost
Python
Machine Learning
Polynomial Regression
Feature Engineering
Cost Estimation
Project Overview
Healthcare costs are a significant concern for individuals and insurance companies alike. This project builds a predictive model for medical insurance charges based on customer demographics and health factors, with direct applicability to both individual financial planning and insurance policy pricing.
The 1,338-record dataset includes age, BMI, smoking status, region, sex, and number of children as predictors of annual insurance charges. Linear regression was benchmarked first (R² = 0.63), then polynomial regression (degree 2) was applied after residual plots confirmed non-linearity in the BMI-charge and age-charge relationships — achieving R² = 0.81 on the test set, a 29% relative improvement in explained variance.
Model Comparison
| Model | Train R² | Test R² | RMSE (Test) | Notes |
| Linear Regression |
0.65 | 0.63 | $6,012 |
Underfits — residuals show non-linear pattern |
| Polynomial Regression (deg 2) |
0.84 | 0.81 | $4,847 |
Best fit — captures BMI and age non-linearity |
| Polynomial Regression (deg 3) |
0.89 | 0.76 | $5,312 |
Overfitting — train/test gap increases |
Degree 2 provided the best generalisation — degree 3 overfit (train R² = 0.89 but test R² fell to 0.76). The 0.03 train/test gap in degree 2 indicates good generalisation without overfitting.
Key Insights
- Smoking is the dominant predictor: smokers pay an average of $23,615/year vs $8,434 for non-smokers — a 2.8× difference that dwarfs all other factors.
- BMI above 30 (obese classification) is associated with a $4,200 average premium increase, and the effect is non-linear — polynomial terms captured an accelerating cost increase at higher BMI values.
- Southeast region has the highest average charges at $14,735, compared to $12,346 (Southwest), $13,406 (Northeast), and $12,417 (Northwest) — a 20% regional premium gap worth investigating for pricing strategy.
- Age-charge relationship is non-linear: charges increase more steeply after age 45, which the polynomial model captures — linear regression systematically under-predicted for older customers.
- Children count adds a small but statistically significant cost increment (~$475 per child on average).
Technical Implementation
Data Preprocessing:
- No missing values in the dataset. Checked for outliers using IQR method — kept as they represent real high-cost patients.
- One-hot encoded: region (4 categories → 3 dummies), sex (binary). Label encoded: smoker (yes/no → 1/0).
Feature Engineering:
- Applied
PolynomialFeatures(degree=2, interaction_only=False) from sklearn — generates all polynomial and interaction terms from the original 6 features.
- Explored the
smoker × BMI interaction term — statistically significant: obesity amplifies the smoking premium.
Model Evaluation:
- 80/20 train/test split with random state fixed for reproducibility.
- 5-fold cross-validation confirmed test R² = 0.81 ± 0.03 (stable across folds).
- RMSE of $4,847 on test set — model predictions are within ~$5k of actual charges on average.
Key Learnings
- Residual plots are essential before choosing model complexity — the non-random pattern in linear regression residuals (funnel shape with BMI) directly indicated the need for polynomial terms. This is how polynomial regression should always be motivated, not by trial and error.
- Overfitting risk is real even with 6 features — polynomial degree 3 added 83 features to a 1,338-row dataset, causing train/test gap to increase. Regularisation (Ridge/Lasso) would be the next step before degree 3.
- Interaction terms reveal business insights — the smoker × BMI interaction showing that obesity compounds the smoking cost increase is not just a statistical artefact; it's actionable for underwriting teams.
Future Work
- Apply Ridge regression to degree-3 polynomial features — regularisation might allow higher-order terms without overfitting, potentially improving the test R².
- Evaluate gradient boosting models (XGBoost) as a non-parametric baseline — they may naturally capture the non-linearities without requiring explicit polynomial feature creation.
- Add confidence intervals on predictions — for insurance pricing, a point estimate without uncertainty quantification is insufficient for actuarial use.
Built by Om Patel — ML Engineer & Data Scientist.
Explore more projects on my
Portfolio.